Content
Introduction.
Problem Statement and Dataset.
Data Preprocessing and Cleaning.
Exploratory Data Analysis.
Base model with all variables.
Kruskal - Wallis test on categorical variables.
Feature Selection and Feature Engineering.
Modelling :
Testing assumptions in Linear Regression for final model.
Suggestions for additional data
Business Solution
References.
Approach.
CLV is the total worth to a business of a customer over the whole period of their relationship. It’s an important metric as it costs less to keep existing customers than it does to acquire new ones, so increasing the value of your existing customers is a great way to drive growth.
CLV helps the organization in :
The task is to understand the problem of Customer Lifetime Value and provide a data-driven solution through data methodologies.To understand customer behaviour and to find out the profitable customers.To predict the lifetime valuation of a customer to facilitate target marketing
The dataset delineates the demographics and policy specifics purchased by the customers from a bank/firm. Customer Lifetime Valuation (Customer.Lifetime.Value) is the target variable that is predicted.A preview of the dataset is :
## Customer State Customer.Lifetime.Value Response Coverage Education
## 1 BU79786 Washington 2763.519 No Basic Bachelor
## 2 QZ44356 Arizona 6979.536 No Extended Bachelor
## 3 AI49188 Nevada 12887.432 No Premium Bachelor
## 4 WW63253 California 7645.862 No Basic Bachelor
## Effective.To.Date EmploymentStatus Gender Income Location.Code Marital.Status
## 1 2/24/11 Employed F 56274 Suburban Married
## 2 1/31/11 Unemployed F 0 Suburban Single
## 3 2/19/11 Employed F 48767 Suburban Married
## 4 1/20/11 Unemployed M 0 Suburban Married
## Monthly.Premium.Auto Months.Since.Last.Claim Months.Since.Policy.Inception
## 1 69 32 5
## 2 94 13 42
## 3 108 18 38
## 4 106 18 65
## Number.of.Open.Complaints Number.of.Policies Policy.Type Policy
## 1 0 1 Corporate Auto Corporate L3
## 2 0 8 Personal Auto Personal L3
## 3 0 2 Personal Auto Personal L3
## 4 0 7 Corporate Auto Corporate L2
## Renew.Offer.Type Sales.Channel Total.Claim.Amount Vehicle.Class Vehicle.Size
## 1 Offer1 Agent 384.8111 Two-Door Car Medsize
## 2 Offer3 Agent 1131.4649 Four-Door Car Medsize
## 3 Offer1 Agent 566.4722 Two-Door Car Medsize
## 4 Offer1 Call Center 529.8813 SUV Medsize
The variables available in the dataset are :
## [1] "Customer" "State"
## [3] "Customer.Lifetime.Value" "Response"
## [5] "Coverage" "Education"
## [7] "Effective.To.Date" "EmploymentStatus"
## [9] "Gender" "Income"
## [11] "Location.Code" "Marital.Status"
## [13] "Monthly.Premium.Auto" "Months.Since.Last.Claim"
## [15] "Months.Since.Policy.Inception" "Number.of.Open.Complaints"
## [17] "Number.of.Policies" "Policy.Type"
## [19] "Policy" "Renew.Offer.Type"
## [21] "Sales.Channel" "Total.Claim.Amount"
## [23] "Vehicle.Class" "Vehicle.Size"
The dimension of the dataset is (9134, 24) .
There are 15 categorical variables out of which one is the customer id, 8 numerical variables (including target variable) and one date. The structure of these variables can be inferred here :
## 'data.frame': 9134 obs. of 24 variables:
## $ Customer : chr "BU79786" "QZ44356" "AI49188" "WW63253" ...
## $ State : chr "Washington" "Arizona" "Nevada" "California" ...
## $ Customer.Lifetime.Value : num 2764 6980 12887 7646 2814 ...
## $ Response : chr "No" "No" "No" "No" ...
## $ Coverage : chr "Basic" "Extended" "Premium" "Basic" ...
## $ Education : chr "Bachelor" "Bachelor" "Bachelor" "Bachelor" ...
## $ Effective.To.Date : chr "2/24/11" "1/31/11" "2/19/11" "1/20/11" ...
## $ EmploymentStatus : chr "Employed" "Unemployed" "Employed" "Unemployed" ...
## $ Gender : chr "F" "F" "F" "M" ...
## $ Income : int 56274 0 48767 0 43836 62902 55350 0 14072 28812 ...
## $ Location.Code : chr "Suburban" "Suburban" "Suburban" "Suburban" ...
## $ Marital.Status : chr "Married" "Single" "Married" "Married" ...
## $ Monthly.Premium.Auto : int 69 94 108 106 73 69 67 101 71 93 ...
## $ Months.Since.Last.Claim : int 32 13 18 18 12 14 0 0 13 17 ...
## $ Months.Since.Policy.Inception: int 5 42 38 65 44 94 13 68 3 7 ...
## $ Number.of.Open.Complaints : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Number.of.Policies : int 1 8 2 7 1 2 9 4 2 8 ...
## $ Policy.Type : chr "Corporate Auto" "Personal Auto" "Personal Auto" "Corporate Auto" ...
## $ Policy : chr "Corporate L3" "Personal L3" "Personal L3" "Corporate L2" ...
## $ Renew.Offer.Type : chr "Offer1" "Offer3" "Offer1" "Offer1" ...
## $ Sales.Channel : chr "Agent" "Agent" "Agent" "Call Center" ...
## $ Total.Claim.Amount : num 385 1131 566 530 138 ...
## $ Vehicle.Class : chr "Two-Door Car" "Four-Door Car" "Two-Door Car" "SUV" ...
## $ Vehicle.Size : chr "Medsize" "Medsize" "Medsize" "Medsize" ...
The summary statistics of the numerical variables are :
## vars n mean sd median
## Customer* 1 9134 4567.500000 2.636903e+03 4567.5000
## State* 2 9134 2.741734 1.287771e+00 2.0000
## Customer.Lifetime.Value 3 9134 8004.940475 6.870968e+03 5780.1822
## Response* 4 9134 1.143201 3.502971e-01 1.0000
## Coverage* 5 9134 1.480622 6.558173e-01 1.0000
## Education* 6 9134 2.554084 1.381978e+00 2.0000
## Effective.To.Date* 7 9134 30.057915 1.696353e+01 30.0000
## EmploymentStatus* 8 9134 2.825706 1.347793e+00 2.0000
## Gender* 9 9134 1.490037 4.999281e-01 1.0000
## Income 10 9134 37657.380009 3.037990e+04 33889.5000
## Location.Code* 11 9134 1.979089 6.057325e-01 2.0000
## Marital.Status* 12 9134 2.120210 6.368385e-01 2.0000
## Monthly.Premium.Auto 13 9134 93.219291 3.440797e+01 83.0000
## Months.Since.Last.Claim 14 9134 15.097000 1.007326e+01 14.0000
## Months.Since.Policy.Inception 15 9134 48.064594 2.790599e+01 48.0000
## Number.of.Open.Complaints 16 9134 0.384388 9.103835e-01 0.0000
## Number.of.Policies 17 9134 2.966170 2.390182e+00 2.0000
## Policy.Type* 18 9134 1.825925 4.759888e-01 2.0000
## Policy* 19 9134 4.797788 1.605978e+00 5.0000
## Renew.Offer.Type* 20 9134 1.970221 1.007576e+00 2.0000
## Sales.Channel* 21 9134 2.102693 1.069452e+00 2.0000
## Total.Claim.Amount 22 9134 434.088794 2.905001e+02 383.9454
## Vehicle.Class* 23 9134 3.036019 2.176820e+00 1.0000
## Vehicle.Size* 24 9134 2.089556 5.373128e-01 2.0000
## trimmed mad min max
## Customer* 4.567500e+03 3385.5171 1.000000 9134.00
## State* 2.692939e+00 1.4826 1.000000 5.00
## Customer.Lifetime.Value 6.631050e+03 3658.8991 1898.007675 83325.38
## Response* 1.054050e+00 0.0000 1.000000 2.00
## Coverage* 1.363027e+00 0.0000 1.000000 3.00
## Education* 2.466201e+00 1.4826 1.000000 5.00
## Effective.To.Date* 3.005761e+01 22.2390 1.000000 59.00
## EmploymentStatus* 2.712644e+00 0.0000 1.000000 5.00
## Gender* 1.487548e+00 0.0000 1.000000 2.00
## Income 3.572792e+04 42522.4506 0.000000 99981.00
## Location.Code* 1.973864e+00 0.0000 1.000000 3.00
## Marital.Status* 2.150246e+00 0.0000 1.000000 3.00
## Monthly.Premium.Auto 8.734031e+01 26.6868 61.000000 298.00
## Months.Since.Last.Claim 1.467761e+01 11.8608 0.000000 35.00
## Months.Since.Policy.Inception 4.786932e+01 35.5824 0.000000 99.00
## Number.of.Open.Complaints 1.325944e-01 0.0000 0.000000 5.00
## Number.of.Policies 2.541461e+00 1.4826 1.000000 9.00
## Policy.Type* 1.855638e+00 0.0000 1.000000 3.00
## Policy* 4.931582e+00 1.4826 1.000000 9.00
## Renew.Offer.Type* 1.837849e+00 1.4826 1.000000 4.00
## Sales.Channel* 2.003421e+00 1.4826 1.000000 4.00
## Total.Claim.Amount 4.026767e+02 213.5753 0.099007 2893.24
## Vehicle.Class* 2.920088e+00 0.0000 1.000000 6.00
## Vehicle.Size* 2.111932e+00 0.0000 1.000000 3.00
## range skew kurtosis se
## Customer* 9133.000 0.000000000 -1.2003941 2.759076e+01
## State* 4.000 0.210353443 -1.2522205 1.347436e-02
## Customer.Lifetime.Value 81427.374 3.031284401 13.8116290 7.189313e+01
## Response* 1.000 2.036897794 2.1491880 3.665271e-03
## Coverage* 2.000 1.030971447 -0.1071182 6.862026e-03
## Education* 4.000 0.333038262 -1.3708988 1.446008e-02
## Effective.To.Date* 58.000 -0.007240518 -1.1809683 1.774948e-01
## EmploymentStatus* 4.000 0.831874159 -1.0269468 1.410239e-02
## Gender* 1.000 0.039852474 -1.9986306 5.230908e-03
## Income 99981.000 0.286793057 -1.0948011 3.178747e+02
## Location.Code* 2.000 0.009507714 -0.2774384 6.337973e-03
## Marital.Status* 2.000 -0.107517379 -0.5805584 6.663445e-03
## Monthly.Premium.Auto 237.000 2.122849037 6.1875462 3.600216e-01
## Months.Since.Last.Claim 35.000 0.278494819 -1.0741586 1.053997e-01
## Months.Since.Policy.Inception 99.000 0.040151771 -1.1334913 2.919893e-01
## Number.of.Open.Complaints 5.000 2.782348975 7.7420578 9.525634e-03
## Number.of.Policies 8.000 1.252921117 0.3615648 2.500924e-02
## Policy.Type* 2.000 -0.468229157 0.4983517 4.980423e-03
## Policy* 8.000 -0.427498589 0.1984827 1.680386e-02
## Renew.Offer.Type* 3.000 0.717161897 -0.6293888 1.054259e-02
## Sales.Channel* 3.000 0.506607401 -1.0369589 1.119002e-02
## Total.Claim.Amount 2893.141 1.714402582 5.9735064 3.039595e+00
## Vehicle.Class* 5.000 0.254371988 -1.7698517 2.277677e-02
## Vehicle.Size* 2.000 0.072717859 0.3436722 5.622076e-03
There are no missing values in our dataset. This can be inferred from the following graph :
The Effective.To.Date column contains two different date formats : “1/20/11” and “02-03-2011”. We have cleaned this and converted them into a unique format which is : “2011-02-21”.
Questions asked before approaching EDA:
The above graph shows that the target variable, Customer Lifetime value has a positively skewed distribution.
The boxplot of the target variable shows there are quite a few extreme values to the right which may also be observed in real life cases. This can be adjusted through some transformations, like log transformations.
We can infer that the distribution of CLV for customers with education level below highschool and Masters are more significant than the other customers. But this plot also shows that education does not affect the target variable to a great extend.
The above 2 plots suggests that the customers who opted for Premium and Extended policy coverages are more valuable to the company when compared to those customers with Basic policy coverage.
This plot shows the distribution of CLV for the customers based on the states they belong. We can see that the distributions are more or less similar. This points out that the state from which the customers come are not a valuable feature for determining their importance to the company.
The above frequency distribution plot suggests that the target variable, CLV has more or less similar distribution irrespective of whether the response from the customers to marketting calls is positive or not.
The above plot shows the correlation between numerical features of the data set. The “0” suggests that there is no correlation between Total claim amount with the variables : Income, Number of policies and Number of open complaints. We can also see that none of the pairs of variables have high correlation with the target variable and also between themselves.
The above plots show that we do not see much difference in income for the Gender. We see that married and divorced customers have a greater income when compared to single customers. We can also see that employed customer have a higher income compared to other employees statuses. The last plot suggest that customers living in Urban and Rural residential areas have a greater income compared to suburban.
The above plots shows the basic educations of customers based on their Gender(male,female), Employment status ,Marital status and location code.
The above 4 plots suggest that there is not much difference when it comes to Number of policies for features like Gender,Marital status , Employement status and Location code.
The above 4 plots suggest that there is not much difference when it comes to CLV for features like Gender,Marital status , Employment status and Location code.
This plot shows the comparison of CLV in terms of Gender,Employment status,Marital Status and Residential Area.
Based on bar plot above it can be new strategy for the company, How to offer the Renew Vehicle insurance based on Gender. Nevertheless, more or less nothing different between Female and Male when they decide to renew their insurance. The difference would be in Offer type 4, it shows that Male is more interested to renew their vehicle insurance in new offer type 4 than female.
Mean of MPA(Monthly Premium Auto) is Mean of MPA is 93.21929 and the Median is 84.00 The Variance in MPA is 1183.908 and the Standard Deviation is 34.40797 There is a Positive Correlation of 39.62 % of MPA with CLV Kurtosis is 6.187546. Since kurtosis > 3, means distribution has thicker tails than normal MPA is positive Skewed
The scatter plot shows that With increase of MPA , CLV also increases.
The positive correlation values close to zero show that that there is no strong relationship MonthsSinceLastClaim with CLV.
Customers who have the personal policy type seem to be more valuable to the company when compared to Corporate and special type.
We had already established that customers with personal policy type were more valuable. Digging deeper, we see that personal L3 customers are highly valuable to the company.
According to above two plot, customers with 2 Number of policies have higher CLV than that of others. It also suggests that the policy they opted for the first time was satisfactory hence they applied for the second one too. Customers who have more than 3 policies show a similar pattern in their CLV values. This may suggest that those customers having more than three policies may not significant in our model. This is visible in the second plot. While modelling a possible approach can be binning the Number.of.Policies variable. For eg : those with policies more than 3 can be combined into a single group.
Most of the customers have no complaints with the service provided. The mean value of CLV for customers with number of complaints below 3 shows similar pattern and hence those customers with more number of complaints who also have relatively less mean CLV should be taken care for better profit for the organization. A similar approach of binning can be applied here also as that suggested for the variable Number.of.Policies.
This bar plot Combines policy type and policy for Visualizing purposes. The personal policy type L3 has higher density among others.
This pie chart shows the proportion of Vehicle Class. We see that this data set has 50% of customers with “four door cars”. Which may also suggest that there is a higher number of middle class customers.
This pie chart shows the proportion of Vehicle size. We see that this data set has 70.3% of customers with “Medium Size”, which could support our previous claim.
The above two plots show the distribution of CLV in terms of Vehicle size and Vehicle class.The customers who own luxury cars have relatively higher CLV when compared to others, which is obvious. But we have more number of regular middle-class customers.
This plot shows that CLV increases with the increase in total claim amount for different policies. But there is not prefect linear relationship.
This plot shows that the company must invest in agent and branches as their primary source of sales than Call center and web. The marketing through call centers and Web should be improved which can lead to more profit to the company.
Train Result
## [1] "MAE : 3871.59106307986"
## [1] "RMSE : 6241.96232879225"
## [1] "MSE : 38962093.7140616"
## [1] "R2 : 0.172241959232048"
## [1] "Adjusted R2 : 0.169741866302766"
Test Result
## [1] "MAE : 3922.20512370246"
## [1] "RMSE : 6410.31417826123"
## [1] "MSE : 41092127.8640169"
## [1] "R2 : 0.139383348571966"
## [1] "Adjusted R2 : 0.128888023554551"
he Kruskal–Wallis test is a nonparametric method and hence does not assume a normal distribution of the residuals, unlike the analogous one-way analysis of variance.The test can be implemented in R using the kruskal.test(x,g) function. The x parameter is a continuous (interval/ratio) variable. The g parameter is the categorical variable representing different groups to which the continuous values belong.The test does not identify where this stochastic dominance occurs or for how many pairs of groups stochastic dominance obtains. Even if we use ANOVA we get similar results.
The null hypothesis with a Kruskal-Wallace test is that all the different groups represented by the samples are very similar based on the median value.
Here let us see how the test works with categorical variables with two examples.
State
##
## Kruskal-Wallis rank sum test
##
## data: df$Customer.Lifetime.Value and as.factor(df$State)
## Kruskal-Wallis chi-squared = 5.0721, df = 4, p-value = 0.28
## df$State df$Customer.Lifetime.Value
## 1 Arizona 7861.341
## 2 California 8003.648
## 3 Nevada 8056.707
## 4 Oregon 8077.901
## 5 Washington 8021.472
There is 28% chance that the means are same.Therefore, we fail to reject the null hypothesis. There is no significant difference in CLV among the customers from different States.Thus State variable can be avoided in our model. We can also infer that there is no discernible difference in the mean values of CLV for each State.
Coverage
##
## Kruskal-Wallis rank sum test
##
## data: df$Customer.Lifetime.Value and as.factor(df$Coverage)
## Kruskal-Wallis chi-squared = 502.5, df = 2, p-value < 2.2e-16
## df$Coverage df$Customer.Lifetime.Value
## 1 Basic 7190.706
## 2 Extended 8789.678
## 3 Premium 10895.603
The p value here is < 0.05.Therefore, we reject the null hypothesis. There is significant difference in CLV among the customers with different policy coverages.Thus Coverage variable may add value to our model. By observing the mean of CLV for different policy coverages, it can be inferred that Extended and Premium policy customers are an asset to the organization.
Therefore based on Kruskal Wallace test, the categorical variables that would help in predicting the CLV are:
Feature Engineering.
We have implemented different models with different transformations of the variables for predicting the Customer Lifetime Value. The transformations that have been attempted with numerical variables includes :
Feature selection.
Relative Importance is a technique that is specific to linear regression models.
Relative importance can be used to assess which variables contributed how much in explaining the linear model’s R-squared value. So, if you sum up the produced importances, it will add up to the model’s R- squared value. In essence, it is not directly a feature selection method, because you have already provided the features that go in the model. But after building the model, the relaimpo package can provide a sense of how important each feature is in contributing to the R- squared, or in other words, in ‘explaining the Y variable’.It is implemented in the relaimpo package. Basically, you build a linear regression model and pass that as the main argument to calc.relimp(). The relaimpo has multiple options to compute the relative importance, but the recommended method is to use type=‘lmg’, and we have used the same for feature selection.
Step wise Forward and Backward Selection.
It searches for the best possible regression model by iteratively selecting and dropping variables to arrive at a model with the lowest possible AIC. It can be implemented using the step() function and you need to provide it with a lower model, which is the base model from which it won’t remove any features and an upper model, which is a full model that has all possible features you want to have.
Here we have used all the features available.
Train Result
## [1] "MAE : 3871.59106307986"
## [1] "RMSE : 6241.96232879225"
## [1] "MSE : 38962093.7140616"
## [1] "R2 : 0.172241959232048"
## [1] "Adjusted R2 : 0.169741866302766"
Test Result
## [1] "MAE : 3922.20512370246"
## [1] "RMSE : 6410.31417826123"
## [1] "MSE : 41092127.8640169"
## [1] "R2 : 0.139383348571966"
## [1] "Adjusted R2 : 0.128888023554551"
Train Result
## [1] "MAE : 3871.42551414636"
## [1] "RMSE : 6245.92550222668"
## [1] "MSE : 39011585.3793656"
## [1] "R2 : 0.171190498183599"
## [1] "Adjusted R2 : 0.168687229507053"
Test Result
## [1] "MAE : 3919.26780041461"
## [1] "RMSE : 6408.03198748838"
## [1] "MSE : 41062873.9526743"
## [1] "R2 : 0.139996030477955"
## [1] "Adjusted R2 : 0.129508177191101"
Based on the EDA and feature selection methods the variables used in building the polynomial regression model are : Income , Monthly.Premium.Auto , poly(Number.of.Policies,5), Coverage, EmploymentStatus, Renew.Offer.Type, Vehicle.Size, Vehicle.Class, Marital.Status and Education.
Train Result
## [1] "MAE : 2584.50345837054"
## [1] "RMSE : 4306.71655779153"
## [1] "MSE : 18547807.5091557"
## [1] "R2 : 0.605947849800001"
## [1] "Adjusted R2 : 0.604757686798299"
Test Result
## [1] "MAE : 2662.66955678632"
## [1] "RMSE : 4423.50642411934"
## [1] "MSE : 19567409.0842251"
## [1] "R2 : 0.590188219531591"
## [1] "Adjusted R2 : 0.585190514891732"
The variables used in XGBoost model are : Vehicle.Class, Coverage, Renew.Offer.Type, EmploymentStatus, Policy.Type, Monthly.Premium.Auto, Number.of.Open.Complaints, Total.Claim.Amount, Number.of.Policies, Income and Education.
Train Result
## [1] "MAE : 1316.14598626418"
## [1] "RMSE : 2909.08410449344"
## [1] "MSE : 8462770.3270164"
## [1] "R2 : 0.823283591115292"
## [1] "Adjusted R2 : 0.823017233140815"
Test Result
## [1] "MAE : 1633.98400708824"
## [1] "RMSE : 3640.70099583986"
## [1] "MSE : 13254703.7411094"
## [1] "R2 : 0.701888792154881"
## [1] "Adjusted R2 : 0.700079066279441"
Feature Importance
## Feature Gain Cover Frequency
## 1: Number.of.Policies 0.5890670138 0.117021206 0.080800772
## 2: Monthly.Premium.Auto 0.2939819705 0.313251959 0.247708635
## 3: Total.Claim.Amount 0.0454546654 0.218050711 0.178967680
## 4: Income 0.0291492083 0.204920679 0.165701881
## 5: Number.of.Open.Complaints 0.0078809364 0.038151506 0.068982151
## 6: Policy.TypePersonal Auto 0.0036945356 0.004585525 0.017366136
## 7: Renew.Offer.TypeOffer2 0.0035931683 0.001976270 0.020742885
## 8: EducationCollege 0.0035505528 0.001706600 0.010371442
## 9: CoverageExtended 0.0032933913 0.006331472 0.024602026
## 10: CoveragePremium 0.0025567925 0.005135678 0.011336228
## 11: Renew.Offer.TypeOffer4 0.0022332308 0.003534000 0.011336228
## 12: Vehicle.ClassTwo-Door Car 0.0019955431 0.000944072 0.014471780
## 13: EducationHigh School or Below 0.0019754082 0.003549416 0.006753497
## 14: Vehicle.ClassSports Car 0.0016069991 0.007375634 0.014712976
## 15: EmploymentStatusEmployed 0.0015136332 0.019128789 0.031114327
## 16: Vehicle.ClassSUV 0.0014643702 0.001547376 0.014471780
## 17: Renew.Offer.TypeOffer3 0.0011973708 0.007706278 0.012059817
## 18: EmploymentStatusMedical Leave 0.0011792107 0.001354098 0.005065123
## 19: Vehicle.ClassLuxury Car 0.0008404111 0.001586722 0.014712976
## 20: EmploymentStatusRetired 0.0007782971 0.009556687 0.009889050
## 21: EmploymentStatusUnemployed 0.0007681527 0.004514426 0.016160154
## 22: Policy.TypeSpecial Auto 0.0006328992 0.006314675 0.003859141
## 23: Vehicle.ClassLuxury SUV 0.0005620279 0.001372735 0.006753497
## 24: EducationDoctor 0.0005215955 0.001138501 0.004582730
## 25: EducationMaster 0.0005086154 0.019244986 0.007477086
## Feature Gain Cover Frequency
The final model takes in the variables :
Monthly.Premium.Auto, I(Coverage == “Premium” ), Total.Claim.Amount, Income, I(Number.of.Policies == 1), I(Number.of.Policies == 2), I(EmploymentStatus == “Employed”), I(Policy.Type == “Special Auto”), I(Number.of.Open.Complaints == 3), I(Number.of.Open.Complaints == 4), I(Number.of.Open.Complaints == 5), I( Vehicle.Class== “Sports Car”).
Model Summary
##
## Call:
## lm(formula = Customer.Lifetime.Value ~ Monthly.Premium.Auto +
## I(Coverage == "Premium") + Total.Claim.Amount + Income +
## I(Number.of.Policies == 1) + I(Number.of.Policies == 2) +
## I(EmploymentStatus == "Employed") + I(Policy.Type == "Special Auto") +
## I(Number.of.Open.Complaints == 3) + I(Number.of.Open.Complaints ==
## 4) + I(Number.of.Open.Complaints == 5) + I(Vehicle.Class ==
## "Sports Car"), data = traindf3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8034 -792 12 645 55959
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.808e+02 1.724e+02 -5.108 3.33e-07
## Monthly.Premium.Auto 8.362e+01 2.048e+00 40.831 < 2e-16
## I(Coverage == "Premium")TRUE -5.902e+02 1.856e+02 -3.180 0.00148
## Total.Claim.Amount -3.213e-01 2.454e-01 -1.309 0.19042
## Income 3.836e-03 2.712e-03 1.414 0.15731
## I(Number.of.Policies == 1)TRUE -3.623e+03 1.136e+02 -31.889 < 2e-16
## I(Number.of.Policies == 2)TRUE 8.623e+03 1.253e+02 68.841 < 2e-16
## I(EmploymentStatus == "Employed")TRUE 4.383e+02 1.679e+02 2.610 0.00907
## I(Policy.Type == "Special Auto")TRUE 2.220e+02 2.493e+02 0.891 0.37321
## I(Number.of.Open.Complaints == 3)TRUE -6.136e+02 2.877e+02 -2.132 0.03301
## I(Number.of.Open.Complaints == 4)TRUE -7.730e+02 3.960e+02 -1.952 0.05098
## I(Number.of.Open.Complaints == 5)TRUE -1.187e+03 5.954e+02 -1.994 0.04621
## I(Vehicle.Class == "Sports Car")TRUE 3.929e+02 2.246e+02 1.750 0.08023
##
## (Intercept) ***
## Monthly.Premium.Auto ***
## I(Coverage == "Premium")TRUE **
## Total.Claim.Amount
## Income
## I(Number.of.Policies == 1)TRUE ***
## I(Number.of.Policies == 2)TRUE ***
## I(EmploymentStatus == "Employed")TRUE **
## I(Policy.Type == "Special Auto")TRUE
## I(Number.of.Open.Complaints == 3)TRUE *
## I(Number.of.Open.Complaints == 4)TRUE .
## I(Number.of.Open.Complaints == 5)TRUE *
## I(Vehicle.Class == "Sports Car")TRUE .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4192 on 7297 degrees of freedom
## Multiple R-squared: 0.6338, Adjusted R-squared: 0.6332
## F-statistic: 1052 on 12 and 7297 DF, p-value: < 2.2e-16
Train Result
## [1] "MAE : 2077.39229479091"
## [1] "RMSE : 4187.85907289204"
## [1] "MSE : 17538163.6144041"
## [1] "R2 : 0.633774618404111"
## [1] "Adjusted R2 : 0.633222620706447"
Test Result
## [1] "MAE : 2011.60854846471"
## [1] "RMSE : 4001.44979156014"
## [1] "MSE : 16011600.4343767"
## [1] "R2 : 0.639883498095751"
## [1] "Adjusted R2 : 0.637697360391034"
Repeated 10 fold Cross Validation.
## Linear Regression
##
## 9134 samples
## 9 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 8221, 8221, 8220, 8220, 8222, 8220, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 4145.278 0.6370282 2057.793
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Final Model Interpretation :-
Null Hypothesis - None of the independed variables are significant for CLV.
Alternate Hypothesis – At least some of the independent variables are significant and can effect the CLV.
p-value of model is less than 0.05, so atleast one of the independent variables are significant.
p-value of MonthlyPremiumAuto, NumberofOpenComplaints, NumberofPolicies (Buyer’s == 2 and 1),Coverage(“Premium”), and EmploymentStatus (“People who are Employed”) are less than 0.05. So atlest one of them independed variables are significant and can effect the CLV.
However R squared is 63.38% which means that 93.8% of the actual variance in CLV can be explained by our multiple linear regression model.
Adjusted R squared is 0.6329 which is less than R squared and is the best among all the models we had. We have seen that eventhough XGBoost gives higher adjusted R square, its highly overfitting.
Residual standard error is 4193 which is average, so it means the actual CLV will deviate from the true regression line by approximately 4193 on an average. The smaller the standard error, the less the spread and the more likely it is that any sample mean is close to the population mean. A small standard error is thus a Good Thing.F-statistic: The lower the F-statistic, the closer to a non-significant model. So F-statistic in Final Model is high means it is significant model.
The results of cross validation also suggests that our final model can prove to be good with the unseen data points.
Average error rate of model is 22.9093%, which is low and we can say that model is good.
## [1] 23.95436
Shapiro Test.
Null Hypotheses - Errors are normally distributed.
Alt Hypothese - Errors are not normally distributed.
##
## Shapiro-Wilk normality test
##
## data: residualsCLV[0:5000]
## W = 0.6857, p-value < 2.2e-16
p-value < 0.05, Null Hypotheses get rejected, and so the errors are not normally distributed.
Residuals vs Fitted Plot
Here we can observe a recognizable pattern. This suggests either non-linearity or that other attributes have not been adequately captured.
The Q-Q plot plots the distribution of our residuals against the theoretical normal distribution. There is strong snaking or deviations from the diagonal line towards the right and hence we should consider our residuals non-normally distributed.
Variance Inflation Factor.
## Monthly.Premium.Auto I(Coverage == "Premium")
## 2.107712 1.174011
## Total.Claim.Amount Income
## 2.155873 2.809679
## I(Number.of.Policies == 1) I(Number.of.Policies == 2)
## 1.232306 1.231919
## I(EmploymentStatus == "Employed") I(Policy.Type == "Special Auto")
## 2.763738 1.001672
## I(Number.of.Open.Complaints == 3) I(Number.of.Open.Complaints == 4)
## 1.001090 1.001641
## I(Number.of.Open.Complaints == 5) I(Vehicle.Class == "Sports Car")
## 1.001868 1.046825
Variance inflation factor (VIF) is a measure of the amount of multicollinearity in a set of multiple regression variables.If there is high correlation between two independed variables (high multicollinearity), then you will not be able to seperate out the impact of individual independed variable on depended variable.Instead of inspecting the correlation matrix, a better way to assess multi- collinearity is to compute the variance inflation factor (VIF). The smallest possible value for VIF is 1, which indicates the complete absence of collinearity. Typically in practice there is a small amount of collinearity among the predictors. As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity.
Due to multicolinearity we can’t define the complete impact of only one independed variable on the depended variable.
Breusch-Pagan test.
Null Hypothesis - Homoscedasticity is present in Residuals.
Alternate hypothesis - Heteroskedasticity is present in residuals.
##
## studentized Breusch-Pagan test
##
## data: Reg_4
## BP = 940.67, df = 12, p-value < 2.2e-16
p-value < 0.05, so it rejects that errors are homoscedasticity. So errors terms are heteroscedastic ie, they doesn’t have constant variance which is not good for model. The Scale - Location plot also suggests that there is heteroscedasticity in our residuals as it has a funnel shaped pattern and are not randomly distributed.
These are the Cook’s Distance and residuals versus leverage plot. These plot helps us to find influential cases (i.e., subjects) if any. We can see that the data points 5717, 7284 and 8826 are influential cases against our regression line.The regression results can be altered if we exclude these cases.
The data could provide more insights if it had variables like ‘expenses caused to the company by the customer’ , ‘initial cost of acquiring a customer’ , ‘mode of contacting the customer’ ’ ‘no:of purchases made by each customer’ etc.
The data provided would be more efficient if the time period given was extended(more than 2 months).
Churn would be another necessary feature which would help in predicting if a customer would leave the company. If yes, what would be the offers given to retain the customer.
Credit based Insurance score would be efficient in providing information about the customer and how much of monetary value he can provide to the company.
We would like recommend the following business solutions according to our analysis above.
Team Members: